AITopics | gpt-4 vision

Collaborating Authors

gpt-4 vision

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance

Baranyi, Gábor, Csibi, Zsolt, Fenech, Kristian, Fóthi, Áron, Gaál, Zsófia, Skaf, Joul, Lőrincz, András

arXiv.org Artificial IntelligenceJul-14-2025

This paper introduces the Ambient Intelligence Rehabilitation Support (AIRS) framework, an advanced artificial intelligence-based solution tailored for home rehabilitation environments. AIRS integrates cutting-edge technologies, including Real-Time 3D Reconstruction (RT-3DR), intelligent navigation, and large Vision-Language Models (VLMs), to create a comprehensive system for machine-guided physical rehabilitation. The general AIRS framework is demonstrated in rehabilitation scenarios following total knee replacement (TKR), utilizing a database of 263 video recordings for evaluation. A smartphone is employed within AIRS to perform RT-3DR of living spaces and has a body-matched avatar to provide visual feedback about the excercise. This avatar is necessary in (a) optimizing exercise configurations, including camera placement, patient positioning, and initial poses, and (b) addressing privacy concerns and promoting compliance with the AI Act. The system guides users through the recording process to ensure the collection of properly recorded videos. AIRS employs two feedback mechanisms: (i) visual 3D feedback, enabling direct comparisons between prerecorded clinical exercises and patient home recordings and (ii) VLM-generated feedback, providing detailed explanations and corrections for exercise errors. The framework also supports people with visual and hearing impairments. It also features a modular design that can be adapted to broader rehabilitation contexts. AIRS software components are available for further use and customization.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.08624

Country: Europe > Hungary (0.15)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry:

Health & Medicine > Health Care Providers & Services (1.00)
Health & Medicine > Therapeutic Area (0.88)
Information Technology (0.88)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Communications > Mobile (0.90)
(2 more...)

Add feedback

Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring

Ersoz, Ahmet Bahaddin

arXiv.org Artificial IntelligenceDec-20-2024

The integration of Large Vision-Language Models (LVLMs) such as OpenAI's GPT-4 Vision into various sectors has marked a significant evolution in the field of artificial intelligence, particularly in the analysis and interpretation of visual data. This paper explores the practical application of GPT-4 Vision in the construction industry, focusing on its capabilities in monitoring and tracking the progress of construction projects. Utilizing high-resolution aerial imagery of construction sites, the study examines how GPT-4 Vision performs detailed scene analysis and tracks developmental changes over time. The findings demonstrate that while GPT-4 Vision is proficient in identifying construction stages, materials, and machinery, it faces challenges with precise object localization and segmentation. Despite these limitations, the potential for future advancements in this technology is considerable. This research not only highlights the current state and opportunities of using LVLMs in construction but also discusses future directions for enhancing the model's utility through domain-specific training and integration with other computer vision techniques and digital twins.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2412.16108

Country:

Europe (0.47)
Asia > Middle East > Republic of Türkiye (0.46)

Genre: Research Report > New Finding (0.48)

Industry:

Construction & Engineering (1.00)
Materials > Construction Materials (0.98)
Health & Medicine > Therapeutic Area > Oncology (0.48)
Health & Medicine > Diagnostic Medicine > Imaging (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.38)

Add feedback

Temporal Grounding of Activities using Multimodal Large Language Models

Song, Young Chol

arXiv.org Artificial IntelligenceMay-30-2024

Temporal grounding of activities, the identification of specific time intervals of actions within a larger event context, is a critical task in video understanding. Recent advancements in multimodal large language models (LLMs) offer new opportunities for enhancing temporal reasoning capabilities. In this paper, we evaluate the effectiveness of combining image-based and text-based large language models (LLMs) in a two-stage approach for temporal activity localization. We demonstrate that our method outperforms existing video-based LLMs. Furthermore, we explore the impact of instruction-tuning on a smaller multimodal LLM, showing that refining its ability to process action queries leads to more expressive and informative outputs, thereby enhancing its performance in identifying specific time intervals of activities. Our experimental results on the Charades-STA dataset highlight the potential of this approach in advancing the field of temporal activity localization and video understanding.

llm, multimodal llm, text-based llm, (13 more...)

arXiv.org Artificial Intelligence

2407.06157

Country: North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Zero-shot Building Age Classification from Facade Image Using GPT-4

Zeng, Zichao, Goo, June Moh, Wang, Xinglei, Chi, Bin, Wang, Meihui, Boehm, Jan

arXiv.org Artificial IntelligenceApr-15-2024

A building's age of construction is crucial for supporting many geospatial applications. Much current research focuses on estimating building age from facade images using deep learning. However, building an accurate deep learning model requires a considerable amount of labelled training data, and the trained models often have geographical constraints. Recently, large pre-trained vision language models (VLMs) such as GPT-4 Vision, which demonstrate significant generalisation capabilities, have emerged as potential training-free tools for dealing with specific vision tasks, but their applicability and reliability for building information remain unexplored. In this study, a zero-shot building age classifier for facade images is developed using prompts that include logical instructions. Taking London as a test case, we introduce a new dataset, FI-London, comprising facade images and building age epochs. Although the training-free classifier achieved a modest accuracy of 39.69%, the mean absolute error of 0.85 decades indicates that the model can predict building age epochs successfully albeit with a small bias. The ensuing discussion reveals that the classifier struggles to predict the age of very old buildings and is challenged by fine-grained predictions within 2 decades. Overall, the classifier utilising GPT-4 Vision is capable of predicting the rough age epoch of a building from a single facade image without any training.

age epoch, gpt-4 vision, prediction, (15 more...)

arXiv.org Artificial Intelligence

2404.09921

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > Wales (0.04)
Europe > United Kingdom > Scotland (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Industry:

Energy (0.69)
Banking & Finance > Real Estate (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding

Sun, Qi, Cui, Xiao, Zhou, Wengang, Li, Houqiang

arXiv.org Artificial IntelligenceJan-15-2024

In this study, we tackle the challenge of classifying the object category in point clouds, which previous works like PointCLIP struggle to address due to the inherent limitations of the CLIP architecture. Our approach leverages GPT-4 Vision (GPT-4V) to overcome these challenges by employing its advanced generative abilities, enabling a more adaptive and robust classification process. We adapt the application of GPT-4V to process complex 3D data, enabling it to achieve zero-shot recognition capabilities without altering the underlying model architecture. Our methodology also includes a systematic strategy for point cloud image visualization, mitigating domain gap and enhancing GPT-4V's efficiency. Experimental validation demonstrates our approach's superiority in diverse scenarios, setting a new benchmark in zero-shot point cloud classification.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2401.07572

Genre: Research Report > New Finding (0.34)

Industry: Energy > Oil & Gas > Upstream (0.70)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback